Wire-delay Reduction Analysis of a 3-tier, 8-point Fast Fourier Transform 3d-ic

نویسندگان

  • W. Rhett Davis
  • Hao Hua
  • Ambarish Sule
  • Christopher Mineo
  • Samson Melamed
  • Michael Steer
  • Paul Franzon
چکیده

3D-ICs promise to reduce wire-delays, but non-idealities threaten to diminish the benefit. This paper presents an analysis of the performance improvement of a standard-cell implementation of an FFT when designed for the three-tier process from MIT Lincoln Labs. The methodology is presented, along with analyses of delay, routing congestion, and heat. The methodology uses commercial 2D CAD tools with very simple scripts to link them together, but still achieve 15% reduction in average wire length and 23% reduction in total power. INTRODUCTION As transistor feature sizes have shrunk, the resistance and capacitance of wires have had an effect on circuit delays that is increasingly more significant than the driver resistance or loading gate capacitance. Three-dimensional integrated circuits (3D-ICs) hold the promise of improving digital circuit performance by dramatically increasing transistor density, thereby shortening wires and improving performance. Several studies have supported the hope of improved performance [3,4]. However, non-idealities such as large vertical pitch, wire-congestion, and heat have many researchers worried that 3D-ICs will never realize the promised performance improvement. With the recent availability of a three-tier 3D-IC process from MIT Lincoln Labs [1], there is an opportunity to study the effect of these non-idealities on a prototype circuit. The goal of this work was to reach a large prototype as fast as possible and quantify the performance benefit of the 3D-IC process. How much improvement can we expect from a three-tier process? An upper bound on the improvement would be to ignore the inter-tier vias and assume that wire-length is proportional to the sidedimension of the chip. This simple approach leads to the notion that interconnect-dominated architectures should notice a reduction in the average wire-length by a factor of at most 3 or 42%. Other researchers have performed more thorough investigations of the potential wire-length improvement. Zhang et al [3] used stochastic estimates based on Rent’s rule that show that a three-tier process would see roughly a 40% reduction in the lengths of the longest wires but only a 30% reduction for the average wire. Das et al [4] developed a 3D placer and global router and applied them to the ISPD ’98 benchmark circuits. When applied to a three-tier technology, their approach showed reduction in average wire length of 11% when minimizing inter-tier cuts and 41% when minimizing wire-length. By all accounts, average wire-lengths should be significantly reduced. On the downside, the inter-tier via in the MITLL 3D technology creates a column that consumes all routing tracks for the tier, which can worsen routing congestion problems. Also, the parasitic capacitance of the inter-tier via degrades the benefit of reduced wire-length. Lastly, the Silicon-on-Insulator (SOI) process has active-islands of silicon floating in glass, which worsens the heat dissipation problems. Fig. 1 shows a summary of the technology used in this study. Note that the process has three metal-layers per tier, and the top two tiers (labeled “B” and “C”), are flipped relative to the bottom tier (labeled “A”). We assembled a place-and-route methodology for this process and used it to prototype an 8-point, 32-bit, floating-point, parallel Fast-Fourier Transform (FFT) chip. The parallel FFT application was chosen because it typically requires long wires to connect the function units. In the remainder of this paper, we will present the design methodology used to build the FFT and present analyses of delay, routing congestion, and heat. For more complete details of the FFT design, we refer the reader to [2]. DESIGN METHODOLOGY In order to reach a prototype as quickly as possible, we wanted to write a minimum of new tools and re-use existing two-dimensional (2D) place-and-route tools as much as possible. Two key innovations enabled this approach: tier-non-specific layers and standard-cell vias. The idea behind tier-non-specific layers is to distinguish between layers that are specific to one tier (e.g. M1_A, M1_B, and M1_C) and those that are not (e.g. M1). By putting these layers in the same Cadence Virtuoso technology file, we were able to go back-and-forth between single-tier and multi-tier visions of the design with simple scripts to transform layer names. Management of the layers was a difficult task, given that only 128 user-defined layers are available, meaning only 32 layers per tier (including three-tiers plus the non-specific layers). This would not have been possible, had there been more than three metal layers in each tier. However, we hope that this limitation will disappear when the OpenAccess version of Virtuoso is released later this year, which offers 2 user-defined layers. Once the technology file was completed, standard-cells were designed in the tier-non-specific layers, based on the IIT-SoC library from the Illinois Institute of Technology [6]. We found that two inter-tier vias fit conveniently in the space of one standard cell, as shown in Fig. 3. By splitting each via into two tier-specific standard-cells as shown, we were able to use 2D place-and-route tools without modification. The 3D place-and-route problem is therefore reduced to the following question: how do we optimally place the inter-tier via cells, and how do we align each via with its corresponding cell in another tier? This approach led us to the design-flow illustrated in Fig. 4. The flow begins with a standard-cell Verilog netlist that has been grouped into modules that can be easily floorplanned (on the order of 100 modules of roughly the same area). The partitioning tool k-METIS [7] was used to create a three-way partition, minimizing the number of cuts between the partitions. The three resulting tier-specific netlists could then be floorplanned and placed with Cadence SOC Encounter. After this step, all inter-tier vias are placed in each tier but not aligned between tiers. A custom script then aligns the vias, moving them to the centroid of terminals connected to each net. This approach is crude, because it makes no effort to minimize wire-length between tiers, but it does minimize potential routing congestion while still offering a reduction in wire-length. After the vias were aligned, the clock-tree was inserted, the design was routed, and parasitics were extracted in SOC Encounter. The single-tier DEF files were then imported into Virtuoso for physical verification and preparation for tape-out. In order to analyze timing for the design as a whole, the tierspecific Verilog netlists (with clock-trees inserted) and SPEF files (standard parasitic exchange format) were merged into single netlist and SPEF files for analysis with the static-timing analyzer PrimeTime from Synopsys. For complete details of tool-flow used in this project, we refer the reader to [8]. The SPEF files generated by SOC Encounter offer a good estimate of wire-parasitics, but they offer no estimate of inter-tier via parasitics. To better understand the parasitics in this process, we ran simulations using the 3D field-solver Q3D from Ansoft and found that the capacitance varied widely depending on the number of adjacent wires. An isolated via has a ground capacitance of about 0.9 fF, while a fully-shield via has a coupling capacitance of about 4.2 fF. Because the inter-tier vias are so wide, the worst-case resistance of 0.1 Ω is much smaller than the resistance of normal wires and vias. More complete details of the results are presented in [2]. For the purpose of worst-case timing analysis, our SPEF-merging script inserted a pi-model for each via with the maximum values of capacitance and resistance. DELAY ANALYSIS In order to quantify the improvement of the 3D process, we re-designed the FFT in a single-tier and analyzed the delays in both approaches. Fig. 5 shows the results. The average wire length dropped by 17% by using three-tiers, while the longest wire length was reduced by 41%. These finding support the prediction in [3] that the longest wires would benefit more than the average wire, as well as the finding in [4] that the savings would be less when minimizing the number of inter-tier cuts. Although wire-lengths were reduced, path delays were not significantly reduced. The use of three tiers provided a speed-up of only 2.4%, which means that this design was still limited by gate-delay, rather than wire-delay. The gate delay in this design was about 60 ns, which means that wire-delay was improved by roughly 10%. The real win of the 3D process appears to be total power, which was reduced by 23%. This reduction is a combination of the effects of reduced wire capacitance, reduced clock-power, and reduced shortcircuit power (since this design did not use repeaters). These findings imply that most easily achievable benefit from 3D integration may be a reduction in power, rather than an increase in speed. CONGESTION AND HEAT ANALYSES The findings from the FFT experiment suggest that greater improvements in wire-delays could be realized if we were to minimize wire-lengths rather than inter-tier cuts. However, the inter-tier cuts have an adverse effect on routing congestion, because each via consumes all routing tracks on a tier. To study this effect, we increased the number of inter-tier vias in the power grid and ran trial-routes. Fig. 6 shows histograms of the density of occupied tracks in a G-Cell, where each G-Cell has 10x10 routing tracks and 100% indicates that all tracks are used. Greater than 100% indicates that there were not enough routing resources. The figures show that for 1250 vias/mm, the design was still easily routable, but routability dropped sharply for larger numbers of vias. Interestingly, tier-C was the least congested and was still routable with 2500 vias/mm. This is perhaps due to the fact that the off-chip connections were on tier-C, and there were therefore many more lateral wires than vertical wires. Heat tends to degrade the performance of digital circuits, and so it is important to analyze the heat of a 3D-IC to ensure that any performance gains are not lost. To analyze this heat, we can follow the approach described by Rahman and Reif [5], in which a vertical thermal-conductivity value is found for each tier, and the power of that tier is assumed to be converted to heat, as shown in Fig. 2. This approach yields an upper bound on the thermal conductivity (and a lower-bound on the junction-temperature), because it assumes perfect heat spreading. A lower bound on conductivity that assumes no heat spreading is also possible, but has the down-side that each transistor must be analyzed separately. Here we restrict our analysis to the upper-bound. To compute the upper-bound of thermal conductivity, we assumed that all metal layers were fully populated with metal and found the parallel resistance for each layer, and then summed the resistances of each layer in series to find the aggregate resistance for each tier. The conductivity varied most significantly depending on the inter-tier via density, as shown in Fig. 7. Note that the maximum number of vias allowed by the MITLL design-rules is 17,500 vias/mm (which would leave no room for transistors), but that the FFT design became unroutable beyond 2500 vias/mm. Given the FFT chip area of 2.1 mm x 2.1mm, we calculated the total thermal resistance across the tiers (θtierC+θtierAB in Fig. 1) to be around 1 W K / o . Plugging in the total chip power of 164 mW, we predict that the total temperature difference between the tiers to be less than 1 degree. A higher power density will yield a greater temperature gradient, however, and in Fig. 7 shows that addition of “thermal vias” will have a very limited effect on increasing thermal conductivity and reducing temperature. If other designs are also limited to roughly 2500 vias/mm for routability, then the insertion of vias can increase thermal conductivity by only 60% in this process.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

AMBARISH MUKUND . Design of Pipeline Fast Fourier Transform Processors using 3 Dimensional Integrated Circuit

SULE, AMBARISH MUKUND. Design of Pipeline Fast Fourier Transform Processors using 3 Dimensional Integrated Circuit Technology. (Under the direction of Prof. W. Rhett Davis). Fast Fourier Transform (FFT) processing is an important component of many Digital Signal Processing (DSP) applications and communication systems. We focus on applications requiring large-point FFTs (> 1024), and where high-...

متن کامل

Design and Simulation of 32-Point FFT Using Mixed Radix Algorithm for FPGA Implementation

This paper focus on the development of the fast Fourier transform (FFT), based on Decimation-In-Time (DIT) domain by using Mixed-Radix algorithm (Radix-4 and Radix-8).Fast Fourier transforms, popularly known as FFTs, have become an integral part of any type of digital communication system and a wide variety of approaches have been tried in order to optimize the algorithm for a variety of parame...

متن کامل

Hardware implementation low power high speed FFT core

In recent times, DSP algorithms have received increased attention due to rapid advancements in multimedia computing and high-speed wired and wireless communications. In response to these advances, the search for novel implementations of arithmetic-intensive circuitry has intensified. For the portability requirement in telecommunication systems, there is a need for low power hardware implementat...

متن کامل

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

This paper describes the design and implementation of user defined fused floating-point arithmetic operations that can be used to implement Radix 2 butterfly Fast Fourier Transform (FFT) for complex numbers used in Digital Signal Processing (DSPC) processors. This paper reports the comparison of area, delay and power of fused floating point modules as compared to discrete floating point with re...

متن کامل

Pathologies cardiac discrimination using the Fast Fourir Transform (FFT) The short time Fourier transforms (STFT) and the Wigner distribution (WD)

This paper is concerned with a synthesis study of the fast Fourier transform (FFT), the short time Fourier transform (STFT and the Wigner distribution (WD) in analysing the phonocardiogram signal (PCG) or heart cardiac sounds.     The FFT (Fast Fourier Transform) can provide a basic understanding of the frequency contents of the heart sounds. The STFT is obtained by calculating the Fourier tran...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006